In 2020, the entire world was shaken by the global SARS-CoV-2 pandemic. The first case was reported in China in mid-December 2019. The animal and seafood market in Wuhan, China, is indicated as source of the disease. A lot of countries introduced a state of emergency to limit the spread of the virus. Unfortunately, due to many interstate connections, the connections, the virus spread very quickly to many continents.
The SARS-CoV-2 virus causes the COVID-19 disease, which the symptomps are very similar to those of the seasonal flu. The virus affects the respiratory organs, mainly the lungs.The disease is most dangerous for the elderly and people with so-called concomitant diseases (e.g. diabetes, lung diseases, cardiovascular diseases). Common symptoms of coronavirus infection are:The disease may lead to complications, e.g. pneumonia, acute respiratory distress syndrome.
To date (November 2020) there is no vaccine or effective antiviral drugs. Treatment is based on symptomatic treatment and supportive therapy (in the case of respiratory disorder). In order to counteract the spread of disease, frequent hand washing and surface disinfection are recommended.
The economy was significantly affected by the pandemic. Many countries have decided on implementing the so-called Lockdown - a temporary shutdown of the economy in order to avoid rallying people and infecting others. The education system also suffered - online learning was introduced in many schools and universities.
The following document is a coronavirus case study base on article [here article] published on May 14, 2020. The data concerns 375 patients from Wuhan region of China (Tongji Hospital). In the conducted analysis, the most discriminating biomarkers of patient mortality were identified using machine lerning tools. The problem was defined as a classification task, where the input data included blood samples and laboratory test results.
Data is import from flat file.
cov_cs_df <- read_excel("data\\wuhan_blood_sample_data_Jan_Feb_2020.xlsx")
Dataset has 6120 rows and 81 columns. It is too much to show them all. Below there is only few of columns:
kable(head(cov_cs_df[1:10], 5))
| PATIENT_ID | RE_DATE | age | gender | Admission time | Discharge time | outcome | Hypersensitive cardiac troponinI | hemoglobin | Serum chloride |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2020-01-31 01:09:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | NA | NA |
| NA | 2020-01-31 01:25:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | 136 | NA |
| NA | 2020-01-31 01:44:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | NA | 103.1 |
| NA | 2020-01-31 01:45:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | NA | NA | NA |
| NA | 2020-01-31 01:56:00 | 73 | 1 | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | 0 | 19.9 | NA | NA |
That shows us, the dataset is illegible. Fisrt, we need to clean data to enhance readability of dataset.
data_df <- cov_cs_df %>%
mutate(gender = as.factor(ifelse(gender==1, "male", "female"))) %>%
mutate(outcome = as.factor(ifelse(outcome == 0, "survived", "died"))) %>%
rename(admission_time = 'Admission time',
discharge_time = 'Discharge time')
colnames(data_df)[34] <- "Tumor_necrosis_factor_alfa"
colnames(data_df)[37] <- "Interleukin_1_Beta"
colnames(data_df)[68] <- "gamma_glutamyl_transpeptidase"
patients_df <- data_df %>%
select(PATIENT_ID, age, gender, admission_time, discharge_time, outcome) %>%
drop_na(PATIENT_ID) %>%
mutate("hospitalization_length" = seconds_to_period(difftime(discharge_time,
admission_time,
units = "days" ))) %>%
relocate(hospitalization_length, .after = discharge_time)
data_df <- data_df %>%
fill(PATIENT_ID)
After replace NA values in PATIENT_ID, the dataset looks like below.
kable(head(data_df[1:8], 10))
| PATIENT_ID | RE_DATE | age | gender | admission_time | discharge_time | outcome | Hypersensitive cardiac troponinI |
|---|---|---|---|---|---|---|---|
| 1 | 2020-01-31 01:09:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:25:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:44:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:45:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 01:56:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 19.9 |
| 1 | 2020-01-31 01:59:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 02:09:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-01-31 06:44:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-02-04 19:42:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
| 1 | 2020-02-06 09:14:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | NA |
Still, we see, there is a NA values in others column. This requires replacing these values. It is possible to do it in many ways, but here, the NA values in each column will be replace by median of the values in column.
for(i in 8:ncol(data_df))
{
val <- data_df[i]
val[is.na(val)] <- median(val[!is.na(val)])
data_df[i] <- val
}
After that, the dataset is clean and looks like below.
kable(head(data_df[1:8], 10))
| PATIENT_ID | RE_DATE | age | gender | admission_time | discharge_time | outcome | Hypersensitive cardiac troponinI |
|---|---|---|---|---|---|---|---|
| 1 | 2020-01-31 01:09:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-01-31 01:25:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-01-31 01:44:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-01-31 01:45:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-01-31 01:56:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 19.9 |
| 1 | 2020-01-31 01:59:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-01-31 02:09:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-01-31 06:44:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-02-04 19:42:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
| 1 | 2020-02-06 09:14:00 | 73 | male | 2020-01-30 22:12:47 | 2020-02-17 12:40:09 | survived | 20.6 |
Below short summary of clean dataset.
#tbl_summary(
# data_df,
# by = outcome,
# label = gender ~ "Gender"
#) %>%
# add_n() %>%
# modify_header(label = "") %>%
# add_overall() %>%
# bold_labels()
With clean dataset, it is possible to start analyse dataset.
First, we can check number of patients divided into genders.
gender_bar_plot <- ggplot(patients_df, aes(x = gender, fill = gender)) +
geom_bar() +
theme_bw() +
labs(title = "Number of patients divided into gender",
x = "Gender",
y = "Number of patients",
fill = "Gender")
ggplotly(gender_bar_plot)
Below the chart showing ages of patients.
gender_age_hist <- ggplot(patients_df, aes(x = age, fill=gender)) +
geom_histogram(binwidth = 5.0) +
theme_bw() +
labs(title = "Histogram of patients age and genders.",
x = "Age",
y = "Number of patients",
fill = "Gender") +
scale_x_continuous(breaks = seq(0, max(patients_df$age), 5))
ggplotly(gender_age_hist)
It is possible to create two histograms for each gender.
gender_age_histograms <- ggplot(patients_df, aes(x = age, fill=gender)) +
geom_histogram(binwidth = 5.0) +
theme_bw() +
facet_wrap(~gender) +
labs(title = "Histogram of patients age and genders.",
x = "Age",
y = "Number of patients",
fill = "Gender") +
scale_x_continuous(breaks = seq(0, max(patients_df$age), 5))
ggplotly(gender_age_histograms)
Let show the histograms about outcome due to hospitalization length. First, histogram including each genders.
hosp_length_hist <- ggplot(patients_df, aes(x = hospitalization_length, fill = gender)) +
geom_histogram(binwidth = 1.0) +
theme_bw() +
scale_x_continuous(breaks=seq(0, max(patients_df$hospitalization_length), 1)) +
labs(title = "Number of patients and their hospitalization length",
x = "Hospitalization length (in days)",
y = "Number of patients",
fill= "Gender")
ggplotly(hosp_length_hist)
Next chart will be the histogram including information about did the patient survived or died.
hosp_length_outcome_hist <- ggplot(patients_df, aes(x = hospitalization_length,
fill = outcome)) +
geom_histogram(binwidth = 1.0) +
theme_bw() +
scale_x_continuous(breaks = seq(0, max(patients_df$hospitalization_length), 1)) +
labs(title = "Number of patients, their hospitalization length and outcome (survived or died)",
x = "Hospitalization length (in days)",
y = "Number of patients",
fill = "Outcome")
ggplotly(hosp_length_outcome_hist)
Next, the histograms including every information above, but divided by genders and outcomes.
hosp_length_outcomes_hists <- ggplot(patients_df, aes(x = hospitalization_length,
fill = outcome)) +
geom_histogram(binwidth = 1.0) +
theme_bw() +
facet_grid(gender~outcome) +
scale_x_continuous(breaks = seq(0, max(patients_df$hospitalization_length), 5)) +
scale_y_continuous(breaks = seq(0, 20, 4)) +
labs(title = "Number of patients, their hospitalization length and outcome (survived or died)",
x = "Hospitalization length (in days)",
y = "Number of patients",
fill = "Outcome")
ggplotly(hosp_length_outcomes_hists)
Following charts will be animated. Next will show the number of patients death in next days
patients_death <- patients_df %>%
select(discharge_time, outcome) %>%
filter(outcome == "died") %>%
mutate(discharge_time = as.integer(difftime(discharge_time,
min(patients_df$discharge_time),
units="days"))) %>%
group_by(discharge_time) %>%
summarise(num_of_deaths = n(), .groups="drop")
animated_deaths_plot <- ggplot(patients_death, aes(x = discharge_time,
y = num_of_deaths)) +
geom_line(color = "blue",
size=1.5) +
theme_bw() +
labs(title = "Number of patients death in next days (from addition day)",
x = "Days after addition day",
y = "Number of deaths") +
scale_x_continuous(breaks = seq(0, max(patients_death$discharge_time), 2)) +
scale_y_continuous(breaks = seq(0, max(patients_death$num_of_deaths), 1)) +
transition_reveal(discharge_time)
anim_save("deaths.gif", animated_deaths_plot)
The chart below will show the aggregate number of deaths in next days.
patients_death <- patients_death %>%
arrange(discharge_time) %>%
mutate(num_of_deaths_agg = cumsum(num_of_deaths))
animated_deaths_agg_plot <- ggplot(patients_death,
aes(x = discharge_time,
y = num_of_deaths_agg)) +
geom_line(color = "blue",
size = 1.5) +
theme_bw() +
labs(title = "Number of aggregate patients death in next days (from addition day)",
x = "Days after addition day",
y = "Number of deaths (aggregate)") +
scale_x_continuous(breaks = seq(0, max(patients_death$discharge_time), 2)) +
scale_y_continuous(breaks = seq(0, max(patients_death$num_of_deaths_agg), 10)) +
transition_reveal(discharge_time)
anim_save("deaths_agg.gif", animated_deaths_agg_plot)
ml_df <- data_df %>%
select(-c("PATIENT_ID", "RE_DATE")) %>%
mutate(hosp_len = seconds_to_period(difftime(discharge_time,
admission_time,
units = "days" ))) %>%
relocate(hosp_len, .after = gender) %>%
select(-c("admission_time", "discharge_time"))
inTraining <- createDataPartition(y = ml_df$outcome, p=.70, list=FALSE)
training <- ml_df[inTraining,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
testing <- ml_df[-inTraining,]
ctrl <- trainControl(method="repeatedcv", number = 2, repeats = 5)
set.seed(23)
fit <- train(outcome ~ ., data = training, method = "rf", trControl = ctrl, ntree=10)
rfClasses <- predict(fit, newdata = testing)
confusionMatrix(data =rfClasses, testing$outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction died survived
## died 860 23
## survived 11 941
##
## Accuracy : 0.9815
## 95% CI : (0.9742, 0.9871)
## No Information Rate : 0.5253
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9629
##
## Mcnemar's Test P-Value : 0.05923
##
## Sensitivity : 0.9874
## Specificity : 0.9761
## Pos Pred Value : 0.9740
## Neg Pred Value : 0.9884
## Prevalence : 0.4747
## Detection Rate : 0.4687
## Detection Prevalence : 0.4812
## Balanced Accuracy : 0.9818
##
## 'Positive' Class : died
##